Web Scraping and Cleaning¶

Wednesday 7/6/22

First Things First: Download SelectorGadget and Import Libraries¶

  • Hope you all had a good 4th of July! Did anyone do anything fun?
  • https://selectorgadget.com
  • Allows you to quickly search through the html of a webpage
  • If you took Stats 102A with Professor Chen you should be familiar with it
In [1]:
# For this demonstration, I will use the Team Standard Batting, but you will be using the dataset that you will 
# choose within your teams
import requests
from bs4 import BeautifulSoup
import numpy as np
import pandas as pd

Use requests to Load the Webpage into Python¶

Go to https://baseball-reference.com

BR URLs.png

The URL of this page should be: https://www.baseball-reference.com/leagues/majors/2021.shtml

In [2]:
team_stats_Request = requests.get('https://www.baseball-reference.com/leagues/majors/2021.shtml') 
# use requests to get the website
In [3]:
type(team_stats_Request) 
Out[3]:
requests.models.Response
In [4]:
print(team_stats_Request) # as you can see, you can't do much with a raw Response object
<Response [200]>

Note that the URL contains 2021. To find all web pages from 2000 to 2021, loop over the url by making it a f-string where the year changes with each iteration.

Convert the Request to a BeautifulSoup Object¶

In [5]:
team_stats_soup = BeautifulSoup(team_stats_Request.text)
In [6]:
type(team_stats_soup) # the object we created with BeautifulSoup() is an object of type BeautifulSoup
Out[6]:
bs4.BeautifulSoup

Find Your HTML Nodes Using Selector Gadget¶

Open the SelectorGadget Link, and a small toolbar should appear at the bottom of your screen. Every time you hover over a part of the website, a box should appear around that part

SelectorGadget 1.png

Keep clicking on either (1) unhighlighted parts you do want or (2) highlighted parts you don't want until only what you want is selected

SelectorGadget 2.png

SelectorGadget 3.png

As you can see, only the batting table is selected

Now, copy the HTML nodes that Selector Gadget gives you

SelectorGadget 4.png

For this table, the nodes are: #teams_standard_batting .left , #teams_standard_batting .right, #teams_standard_batting .center

Use the select() method on a BeautifulSoup Object with SelectorGadget to Access Website Text¶

In [7]:
hittingTable = team_stats_soup.select(
'#teams_standard_batting .center , #teams_standard_batting .left, #teams_standard_batting .right'
) # using our nodes plug them in to select() as a string

The obejct hittingTable is an iterable containing each cell of the table. To access the information in each cell, you must use a loop along with the text attribute. For simplicity, I will use the first 3 elements of hittingTable to demonstrate.

In [8]:
hittingTable[0:3] # as you can see, directly accessing the object is not useful
Out[8]:
[<th aria-label="Tm" class="poptip sort_default_asc left" data-stat="team_name" scope="col">Tm</th>,
 <th aria-label="#Bat" class="poptip center" data-stat="batters_used" data-tip="&lt;strong&gt;Number of Players used in Games&lt;/strong&gt;" scope="col">#Bat</th>,
 <th aria-label="BatAge" class="poptip sort_default_asc center" data-stat="age_bat" data-tip="&lt;strong&gt;Batters&amp;#x2019; average age&lt;/strong&gt;&lt;br&gt;Weighted by AB + Games Played" scope="col">BatAge</th>]
In [9]:
hitting_table_elements = []
for element in hittingTable[0:3]:
    hitting_table_elements.append(element.text)
print(hitting_table_elements) # creating a loop and iterating through the elements is the more useful route
['Tm', '#Bat', 'BatAge']

Final Result¶

Using the data that can be accessed from the BeautifulSoup object as shown above, and iterating through the table from each year from 2000 to 2021, your final product should look like this:

Final Result 1.png

It should be a pandas DataFrame, with the year of the team (which is not in the baseball reference tables; you will have to add this in yourself) and every team's stats from every year between 2000 and 2021.

Sike! This Is the Actual Final Result!¶

Final Result 2.png

  • There is another Baseball Reference webpage that lists who made the playoffs during every season - this can be used to add a column showing who made the playoffs (1) and who did not (0)
  • This is a bit challenging to do so I will be providing the code for how to add this on, but feel free to try to do it on your own for an extra challenge

Important Thing to Note Before You Start¶

  • Requesting and/or iterating through multiple web pages will require Python to send requests to a website over and over again
  • If you request a website too many times in a short period of time, your IP Address could be banned from the website

Here are two important guidlines to follow so that your IP Address does not get banned:

  1. Try getting the correct DataFrame for one season first, then create your final loop to loop over all seasons. This way, you won't have to request 21 web pages every time you run your code.
  2. When you are testing your code on one of the seasons (before the final loop), execute the request command in a seperate cell so that you don't have to run the request over and over again. Once you save the request to an object, you don't need to run the request again.

Here Are The Teams!¶

Hitting Pitching Fielding
Arnav Anish Nicole
Victor Aaron Maddie
Avnish Vince
  • I will create groupchats after the meeting for each team
  • This is where you should decide amongst your team who will take on what dataset and where you will communicate

Hitting Team Datasets¶

Hitting.png

Pitching Team Datasets¶

Pitching.png

Fielding Team Datasets¶

Fielding.png

To Do By Next Week's Meeting¶

  1. Discuss with your team to decide what dataset you will be web scraping
  2. Complete your code and upload it to your team's folder in the Google Drive
  3. Fill out the When2Meet for next week
  • I am still deciding, but next week's meeting might be optional as a work session
  • If you have any questions, please let me or your teammates know
  • I will check in with each team sometime early next week
  • Good luck and I will see you all next week!